Loan Default Prediction using XGBoost with explainability

Author: Tareq Galala

Data Preprocessing

Importing Libraries

In [52]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from scipy.spatial.distance import cdist
from sklearn import linear_model 
from sklearn.model_selection import cross_val_score
import sklearn.preprocessing as pp
from sklearn.preprocessing import normalize
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_tree
from xgboost import plot_importance
from numpy import loadtxt
from numpy import sort
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import minmax_scale
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
import shap
import random

sns.set(style="ticks")
%matplotlib inline

Features Description

In [53]:
# Display the data dictionary describing our variables.
Dict = pd.read_csv('Data Dictionary.csv')
display(Dict)
Variable Name Description Type
0 SeriousDlqin2yrs Person experienced 90 days past due delinquenc... Y/N
1 RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lin... percentage
2 age Age of borrower in years integer
3 NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days p... integer
4 DebtRatio Monthly debt payments, alimony,living costs di... percentage
5 MonthlyIncome Monthly income real
6 NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loa... integer
7 NumberOfTimes90DaysLate Number of times borrower has been 90 days or m... integer
8 NumberRealEstateLoansOrLines Number of mortgage and real estate loans inclu... integer
9 NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days p... integer
10 NumberOfDependents Number of dependents in family excluding thems... integer

Loading the data & Inspection

In [54]:
# Reading csv files using pandas
# Setting out the first column as index
Data = pd.read_csv('cs-training.csv', index_col=0)

# Load the CSV file into a Pandas dataframe
DataDF = pd.DataFrame(Data)
print ('Our Data has ' + str(DataDF.shape[0]) + ' observations & ' + str(DataDF.shape[1]) + ' Labels.')
Our Data has 150000 observations & 11 Labels.
In [55]:
DataDF.head()
Out[55]:
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
In [56]:
# Feature names
DataDF.columns
Out[56]:
Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents'],
      dtype='object')
In [57]:
#Some Feature names will cause us problems as they contain arithmetic operation (-), also will shorten the others
#so first thing will change our column names
DataDF.rename(columns={'NumberOfTime30-59DaysPastDueNotWorse':'T30DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse':'T60DaysLate', 'NumberOfTimes90DaysLate' : 'T90DaysLate',
                      'RevolvingUtilizationOfUnsecuredLines': 'Revolving','NumberOfOpenCreditLinesAndLoans': 'OpenCredit', 'NumberRealEstateLoansOrLines': 'RealEstate' }, inplace=True)
In [58]:
# Understanding our data for types and description
DataDF.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
SeriousDlqin2yrs      150000 non-null int64
Revolving             150000 non-null float64
age                   150000 non-null int64
T30DaysLate           150000 non-null int64
DebtRatio             150000 non-null float64
MonthlyIncome         120269 non-null float64
OpenCredit            150000 non-null int64
T90DaysLate           150000 non-null int64
RealEstate            150000 non-null int64
T60DaysLate           150000 non-null int64
NumberOfDependents    146076 non-null float64
dtypes: float64(4), int64(7)
memory usage: 13.7 MB
In [59]:
# Statistical info to help us understand our dataset
DataDF.describe()
Out[59]:
SeriousDlqin2yrs Revolving age T30DaysLate DebtRatio MonthlyIncome OpenCredit T90DaysLate RealEstate T60DaysLate NumberOfDependents
count 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 1.202690e+05 150000.000000 150000.000000 150000.000000 150000.000000 146076.000000
mean 0.066840 6.048438 52.295207 0.421033 353.005076 6.670221e+03 8.452760 0.265973 1.018240 0.240387 0.757222
std 0.249746 249.755371 14.771866 4.192781 2037.818523 1.438467e+04 5.145951 4.169304 1.129771 4.155179 1.115086
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.029867 41.000000 0.000000 0.175074 3.400000e+03 5.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.154181 52.000000 0.000000 0.366508 5.400000e+03 8.000000 0.000000 1.000000 0.000000 0.000000
75% 0.000000 0.559046 63.000000 0.000000 0.868254 8.249000e+03 11.000000 0.000000 2.000000 0.000000 1.000000
max 1.000000 50708.000000 109.000000 98.000000 329664.000000 3.008750e+06 58.000000 98.000000 54.000000 98.000000 20.000000

Data Imputations and Missing Values

The data provided are taken from real-world sources and we expect the data to have errors. From my observations there are missing value & typo errors from when the data was entered and some values that were coded and doesn’t reflect the true meaning as the rest of the data.

In [60]:
# Dropping any duplicates in the data
DataDF = DataDF.drop_duplicates()

# Finding the shape of our data
print ('Our Data has ' + str(DataDF.shape[0]) + ' observations & ' + str(DataDF.shape[1]) + ' Labels.')
Our Data has 149391 observations & 11 Labels.
In [61]:
print (DataDF.shape[0] - DataDF.count())
SeriousDlqin2yrs          0
Revolving                 0
age                       0
T30DaysLate               0
DebtRatio                 0
MonthlyIncome         29221
OpenCredit                0
T90DaysLate               0
RealEstate                0
T60DaysLate               0
NumberOfDependents     3828
dtype: int64
In [62]:
# As shown above the missing observations in our data as following MonthlyIncome:29221, NumberOfDependents:3828
print ('The percentage of missing values in NumberOfDependents columns is ' + str((3828/150000)* 100) + ' %.')
print ('The percentage of missing values in MonthlyIncome columns is ' + str((29221/150000)* 100) + ' %.')
The percentage of missing values in NumberOfDependents columns is 2.552 %.
The percentage of missing values in MonthlyIncome columns is 19.480666666666664 %.
In [63]:
# The MonthlyIncome is tricky but for now i will fill the missing values with median.
DataDF['MonthlyIncome'].fillna(DataDF['MonthlyIncome'].median(), inplace=True)
sum(DataDF['MonthlyIncome'].isnull())
Out[63]:
0
In [64]:
# My initial analysis is the missing values in NumberOfDependents or NAN values might be because 
# it might not be applicable for the borrower so i will fill the missing values here with 0 or mode.
DataDF['NumberOfDependents'].fillna(DataDF['NumberOfDependents'].mode()[0], inplace=True)
sum(DataDF['NumberOfDependents'].isnull()) 
Out[64]:
0
In [65]:
# Lets make sure we dont have missing values in our data
print (DataDF.shape[0] - DataDF.count())
SeriousDlqin2yrs      0
Revolving             0
age                   0
T30DaysLate           0
DebtRatio             0
MonthlyIncome         0
OpenCredit            0
T90DaysLate           0
RealEstate            0
T60DaysLate           0
NumberOfDependents    0
dtype: int64

Exploratory Data Analysis

Correlation Analysis

In [66]:
#Corr() is a pandas option which produces a data frame of all correlation coefficients for each pairwise pair. 
#Collinearity help us to reduce the number of variables that can explain our target variable, 
#it can initially be inspected through studying pairwise correlation between features using a correlation matrix.
plt.figure(figsize=(10,8))
sns.heatmap(DataDF.corr(), annot=True, fmt=".2f", cmap='viridis')
plt.title('Pearson Correlation Plot')
plt.xticks(rotation=55)
# Bug Fixed for top and bottom cropped
b, t = plt.ylim()
b += 0.4 
t -= 0.4
plt.ylim(b, t)
plt.show()

Comment:
$T90DaysLate$, $ T60DaysLate$ and $T030DaysLate$ are highly correlated. At this stage we are exploring our variables, but we will deal with cleaning the data, detecting outliers, driving new features. Therefore, we will return to Pearson correlation in a later stage which will be applied to logs transformation.

Data Distributions & Data Wrangling

In [68]:
# Here will explore the distrubutions of all our features
col_num = DataDF.shape[1]
fig, axis = plt.subplots(3, 4, figsize=(18, 10))
i = 0
for row in axis:
    for col in row:
        if (i<col_num):
            DataDF.hist(column = DataDF.columns[i], bins = 20, ax=col)
            i = i+1          
In [69]:
# Here will use pandas skew to find the skewness of our features
num_feats = DataDF.dtypes[DataDF.dtypes!='object'].index
skew_feats=DataDF[num_feats].skew().sort_values(ascending=False)
skewness=pd.DataFrame({'Skew':skew_feats})
skewness
Out[69]:
Skew
MonthlyIncome 126.890468
Revolving 97.433211
DebtRatio 94.979721
T60DaysLate 25.424388
T90DaysLate 25.107372
T30DaysLate 24.474608
RealEstate 3.484705
SeriousDlqin2yrs 3.463772
NumberOfDependents 1.620289
OpenCredit 1.221834
age 0.192258

Comment:
A positive skewness indicates an asymmetry in the distribution towards the right-hand side of the distribution. The distribution is highly skewed for the all variables except age which has a near zero distribution indicating a near normal distribution. We will apply a log transformation for all our variables except the target to lessen the skewness.

In [70]:
# Age
# Lets see how many borrowers are less than 18 which is the the legal age of borrowing.
Count_legal_age = len(DataDF.loc[DataDF["age"] < 18, "age"])
print("Total number of borrowers under the legal age {} is {}".format(18, Count_legal_age))
# Replacing the zero age into the median of age column.
DataDF.loc[DataDF["age"] == 0, "age"] = DataDF.age.median()

# Number Of Dependents
# Looking at the stats above i noticed NumberOfDependents with a min of zero and a max of 20. 
Dependents = len(DataDF.loc[DataDF["NumberOfDependents"] > 10, "NumberOfDependents"])
print("Total number of dependents above 10 is {}".format(Dependents))
Total number of borrowers under the legal age 18 is 1
Total number of dependents above 10 is 2
In [71]:
# Number Of Dependents
# We have twp observation with 13 and 20 independents which looks like typo error and should be removed from our data.
DataDF.drop(DataDF[DataDF['NumberOfDependents'] == 13].index, inplace = True)
DataDF.drop(DataDF[DataDF['NumberOfDependents'] == 20].index, inplace = True)
In [72]:
# Applying log on all features and creating new columns to compare later.
# Worthy to note here that i added a fixed constant 1 inside the log, This works very well as the log of 1 is 0 as it could help with errors.
DataDF['Revolving_log'] = np.log(DataDF.Revolving + 1)
DataDF['age_log'] = np.log(DataDF.age + 1)
DataDF['30Days_log'] = np.log(DataDF.T30DaysLate + 1)
DataDF['DebtRatio_log'] = np.log(DataDF.DebtRatio + 1)
DataDF['MonthlyIncome_log'] = np.log(DataDF.MonthlyIncome + 1)
DataDF['OpenCredit_log'] = np.log(DataDF.OpenCredit + 1)
DataDF['RealEstate_log'] = np.log(DataDF.RealEstate + 1)
DataDF['Dependents_log'] = np.log(DataDF.NumberOfDependents + 1)
DataDF['60Days_log'] = np.log(DataDF.T60DaysLate + 1)
DataDF['90Days_log'] = np.log(DataDF.T90DaysLate + 1)

# Apply normalization, we will create different subset later for testing
# Normalization function which basically transform the observation to between 0 and 1.
# The log function plus normalization is a good way to transform skewed data.
# Min-Max normalization function 
#def normalize(column):
#    upper = column.max()
#    lower = column.min()
#    y = (column - lower)/(upper-lower)
#    return y

# Using our function above we can normalize our features using min max
#DataDF['RevolvingUtilizationOfUnsecuredLines_log_normalized'] = normalize(DataDF.RevolvingUtilizationOfUnsecuredLines_log)
#DataDF['age_log_normalized'] = normalize(DataDF.age_log)
#DataDF['NumberOfTimes30DaysLate_log_normalized'] = normalize(DataDF.NumberOfTimes30DaysLate_log)
#DataDF['DebtRatio_log_normalized'] = normalize(DataDF.DebtRatio_log)
#DataDF['MonthlyIncome_log_normalized'] = normalize(DataDF.MonthlyIncome_log)
#DataDF['NumberOfOpenCreditLinesAndLoans_log_normalized'] = normalize(DataDF.NumberOfOpenCreditLinesAndLoans_log)
#DataDF['NumberRealEstateLoansOrLines_log_normalized'] = normalize(DataDF.NumberRealEstateLoansOrLines_log)
#DataDF['NumberOfDependents_log_normalized'] = normalize(DataDF.NumberOfDependents_log)
#DataDF['NumberOfTimes60DaysLate_log_normalized'] = normalize(DataDF.NumberOfTimes60DaysLate_log)
#DataDF['NumberOfTimes90DaysLate_log_normalized'] = normalize(DataDF.NumberOfTimes90DaysLate_log)
In [73]:
DataDF.head().T
Out[73]:
1 2 3 4 5
SeriousDlqin2yrs 1.000000 0.000000 0.000000 0.000000 0.000000
Revolving 0.766127 0.957151 0.658180 0.233810 0.907239
age 45.000000 40.000000 38.000000 30.000000 49.000000
T30DaysLate 2.000000 0.000000 1.000000 0.000000 1.000000
DebtRatio 0.802982 0.121876 0.085113 0.036050 0.024926
MonthlyIncome 9120.000000 2600.000000 3042.000000 3300.000000 63588.000000
OpenCredit 13.000000 4.000000 2.000000 5.000000 7.000000
T90DaysLate 0.000000 0.000000 1.000000 0.000000 0.000000
RealEstate 6.000000 0.000000 0.000000 0.000000 1.000000
T60DaysLate 0.000000 0.000000 0.000000 0.000000 0.000000
NumberOfDependents 2.000000 1.000000 0.000000 0.000000 0.000000
Revolving_log 0.568789 0.671490 0.505721 0.210107 0.645657
age_log 3.828641 3.713572 3.663562 3.433987 3.912023
30Days_log 1.098612 0.000000 0.693147 0.000000 0.693147
DebtRatio_log 0.589442 0.115002 0.081684 0.035415 0.024620
MonthlyIncome_log 9.118335 7.863651 8.020599 8.101981 11.060196
OpenCredit_log 2.639057 1.609438 1.098612 1.791759 2.079442
RealEstate_log 1.945910 0.000000 0.000000 0.000000 0.693147
Dependents_log 1.098612 0.693147 0.000000 0.000000 0.000000
60Days_log 0.000000 0.000000 0.000000 0.000000 0.000000
90Days_log 0.000000 0.000000 0.693147 0.000000 0.000000

Variable Exploration

Understanding our variables is vital for analysis. I discussed my data people working in finance to gain domain knowledge to understand the variables and try to derive more features later.

Variable: Revolving
In [74]:
# Description: Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits

# Lets see how many borrowers have Total balance twice their credit line.
# Looking at the original data for RevolvingUtilizationOfUnsecuredLines we notice that there are alot of outliers. The log solved this.
# Count_Revolving = len(DataDF.loc[DataDF["RevolvingUtilizationOfUnsecuredLines"] > 3, "RevolvingUtilizationOfUnsecuredLines"])
# print(Count_Revolving)
# Dropping observations 
# DataDF.drop(DataDF[DataDF['RevolvingUtilizationOfUnsecuredLines'] > 3].index, inplace = True)
In [75]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Revolving Utilization Of Unsecured Lines", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "Revolving"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "Revolving"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "Revolving_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "Revolving_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: Age
In [76]:
# Description: Age of borrower in years
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Age", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "age"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "age"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "age_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "age_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();

# QQ plot to compare our variable with a theoretical normal distribution
sm.qqplot(DataDF['age'], dist=stats.norm, line = 'r')
plt.show()

Comment:
Looking at the above histogram for age we notice that we have a nearly normal distribution. Also, Q-Q plot displays this. As both distributions are close I opted for $age log$ in our analysis so all features follow the same transformation.

Variable: 30DaysLate
In [77]:
# Description: Number of times borrower has been 30-59 days past due but no worse in the last 2 years.

# Tha maximum total of this variable in two years is 24. Which is if the borrower defaulted and was late every single month.
Count_30DaysLate_Outliers = len(DataDF.loc[DataDF["T30DaysLate"] > 24])
print("Total number of borrowers above 24 times 30 days late is {}".format(Count_30DaysLate_Outliers))
Total number of borrowers above 24 times 30 days late is 225
In [78]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Number Of Times 30 Days Late", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "T30DaysLate"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "T30DaysLate"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "30Days_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "30Days_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: Debt Ratio
In [79]:
# Description: Monthly debt payments, alimony,living costs divided by monthy gross income.
# We can summarize this in an equation: Debt Ratio = Monthly debt payments / Monthly gross income
In [80]:
# Checking outliers - more than 4 std
Count_DebtRatio_Outliers = len(DataDF.loc[DataDF["DebtRatio"] > DataDF["DebtRatio"].quantile(0.999)])
print("Total number of borrowers with Debt Ratio higher than 4 Standard deviations is {}".format(Count_DebtRatio_Outliers))
Total number of borrowers with Debt Ratio higher than 4 Standard deviations is 150
In [81]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Debt Ratio", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "DebtRatio"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "DebtRatio"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "DebtRatio_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "DebtRatio_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();

Comment:

Looking at the above graph $DebtRatio$ looks reasonable now, but as expected the ditrubution is not normal even after applying log transformation, but the log transformation helped with the outliers and it is less skewed.

Variable: Monthly Income
In [82]:
# Description: Monthly income
In [83]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Monthly ", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "MonthlyIncome"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "MonthlyIncome"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "MonthlyIncome_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "MonthlyIncome_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: OpenCredit
In [84]:
# Description: Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards).
In [85]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Number Of Open Credit Lines And Loans", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "OpenCredit"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "OpenCredit"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed & Normalized")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "OpenCredit_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "OpenCredit_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: 90DaysLate
In [86]:
# Description: Number of times borrower has been 90 days or more past due.
In [87]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Number Of Times 90 Days Late", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "T90DaysLate"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "T90DaysLate"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "90Days_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "90Days_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: RealEstate
In [88]:
# Description: Number of mortgage and real estate loans including home equity lines of credit.
In [89]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Number Real Estate Loans Or Lines", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "RealEstate"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "RealEstate"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "RealEstate_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "RealEstate_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: 60DaysLate
In [90]:
# Description: Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
In [91]:
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Number Of Times 60 Days Late", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "T60DaysLate"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "T60DaysLate"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "60Days_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "60Days_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
Variable: NumberOfDependents
In [92]:
# Description: Number of dependents in family excluding themselves (spouse, children etc.)
In [93]:
# Plot Histograms with a kernel density estimate Comparision between the variable and the log transformation
fig = plt.figure(figsize=(18,4))
title = fig.suptitle("Number Of Dependents", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.2)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Orignal Data")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "NumberOfDependents"], bins=20, color="blue", label="Non Defaults", hist_kws={'alpha':.7}, kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "NumberOfDependents"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5}, kde_kws={'linewidth':1})
plt.legend();

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Histogram of Log transformed")
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 0, "Dependents_log"], bins=20, color="blue", hist_kws={'alpha':.7}, label="Non Defaults",kde_kws={'linewidth':1})
sns.distplot(DataDF.loc[DataDF['SeriousDlqin2yrs'] == 1, "Dependents_log"] , bins=20, color="green", label="Defaults", hist_kws={'alpha':.5},kde_kws={'linewidth':1})
plt.legend();
In [94]:
# After applying log to our features, all features had a better distrubution and are less skewed.
# In addition the log transformation took care of our outliers
# We will create a new dataset with the new features
NewDataDF = DataDF.drop(['RealEstate', 'T30DaysLate', 'DebtRatio', 'MonthlyIncome',
                         'OpenCredit', 'T90DaysLate', 'RealEstate',
                         'T60DaysLate', 'NumberOfDependents', 'age', 'Revolving'],axis=1,inplace=False)
In [95]:
NewDataDF.head().T
Out[95]:
1 2 3 4 5
SeriousDlqin2yrs 1.000000 0.000000 0.000000 0.000000 0.000000
Revolving_log 0.568789 0.671490 0.505721 0.210107 0.645657
age_log 3.828641 3.713572 3.663562 3.433987 3.912023
30Days_log 1.098612 0.000000 0.693147 0.000000 0.693147
DebtRatio_log 0.589442 0.115002 0.081684 0.035415 0.024620
MonthlyIncome_log 9.118335 7.863651 8.020599 8.101981 11.060196
OpenCredit_log 2.639057 1.609438 1.098612 1.791759 2.079442
RealEstate_log 1.945910 0.000000 0.000000 0.000000 0.693147
Dependents_log 1.098612 0.693147 0.000000 0.000000 0.000000
60Days_log 0.000000 0.000000 0.000000 0.000000 0.000000
90Days_log 0.000000 0.000000 0.693147 0.000000 0.000000
In [96]:
# Let's have a look at pair plots for our log transformed variables with target hue
g = sns.pairplot(NewDataDF, vars=["Revolving_log", "age_log", "30Days_log", "DebtRatio_log", "MonthlyIncome_log",
                                  "OpenCredit_log", "RealEstate_log","60Days_log", "90Days_log"], hue="SeriousDlqin2yrs")
Target: SeriousDlqin2yrs (Response Column)
In [ ]:
# Description: This is our response column or the y column which predicts if the borrower will default or not. The prediction is either 0 for No or 1 for Yes.
In [98]:
# Looking at the ratio in our response column SeriousDlqin2yrs
fig = plt.figure(figsize=(8,4))
default = sns.countplot(x = DataDF.SeriousDlqin2yrs ,palette='muted')
default.set_xlabel('Defaults:  0 (No), 1 (Yes)')
default.set_ylabel('Count')
default.set_title('Serious delinquency 2 yrs')
plt.show()
In [99]:
[N,Y] = DataDF.SeriousDlqin2yrs.value_counts()
print("Total Observation: %f" % (Y+N))
print("Number of people defaulted: %f" % (Y))
print('Default: ' + str((Y/(N+Y))*100) + ' %')
Total Observation: 149389.000000
Number of people defaulted: 10009.000000
Default: 6.699957828220284 %

Comments: Our dataset is imbalanced data which is an unequal distribution of classes. Solutions to this could be under sampling the majority class or oversampling the minority class example SMOTE. But XGBoost solves this problem through scale_pos_weight.

Outliers

In [100]:
# We are proceeding with caution to remove more observations from our data from methods below. However, it did help us in our analysis.
# But since we applied log transformation , there is no need
# Single Variables
# Finding outliers by finding the mean and standard deviation
mean = DataDF['T60DaysLate'].mean()
std = DataDF['T60DaysLate'].std()
upper = mean + 4*std
lower = mean - 4*std
print('Outliers upper limit: '+str(upper))
print('Outliers lower limit: '+str(lower))

#check how many outliers there are
isOutlier = ((DataDF['T60DaysLate']>upper)|(DataDF['T60DaysLate']<lower))
print('outlier count:',(isOutlier*1).sum())
Outliers upper limit: 15.454700406078242
Outliers lower limit: -15.029689193739983
outlier count: 225
In [101]:
# Multiple variable outlier detection
# Outliers Detection using Mahalanobis distance - 
# Let me first set all my columns that i want to examine
cols = ['SeriousDlqin2yrs', 'Revolving_log', 'age_log',
       '30Days_log', 'DebtRatio_log', 'MonthlyIncome_log',
       'OpenCredit_log', '90Days_log',
       'RealEstate_log', '60Days_log',
       'Dependents_log']
# Generating the mean vector
mean_Vector = NewDataDF[cols].mean().values.reshape(1, len(cols))
# Calculating Mahalanobis distance for each row to the mean vector
mahalanobis_Distances  = cdist(NewDataDF[cols], mean_Vector, metric='mahalanobis').flatten()
# Creating a scatter plot where we use a color mapping and use the mahalanobis distances above
plt.figure(num=8, figsize=(10, 6), linewidth=3)
sns.scatterplot(x=NewDataDF['Revolving_log'], y=NewDataDF['MonthlyIncome_log'], hue=mahalanobis_Distances,  palette='plasma')
plt.suptitle('Outliers using Mahalanobis distance')
plt.xlabel('Revolving Utilization Of Unsecured Lines Log transformed')
plt.ylabel('Monthly Income Log transformed')
Out[101]:
Text(0, 0.5, 'Monthly Income Log transformed')
In [102]:
# since we transformed all our features to log transformation we are not removing any outliers from above techniques. However they were useful for our analysis.

Age Relationship with target variable

In [103]:
# Drop outliers in age above 90 years old
DataDF.drop(DataDF[DataDF['age'] > 90].index, inplace = True)
# Plot data and a linear regression model fit.
grouped_by_age = DataDF.groupby('age')
ageDlqCount = grouped_by_age['SeriousDlqin2yrs'].aggregate([np.mean,'count']).reset_index()
ageDlqCount.columns =['Age','DlqMean','Count']
print(ageDlqCount)
sns.regplot(x='Age',y='DlqMean',data=ageDlqCount,  marker="+")
plt.show()
     Age   DlqMean  Count
0   21.0  0.074074    162
1   22.0  0.095109    368
2   23.0  0.114865    592
3   24.0  0.122605    783
4   25.0  0.126866    938
..   ...       ...    ...
65  86.0  0.015000    400
66  87.0  0.022792    351
67  88.0  0.025974    308
68  89.0  0.032847    274
69  90.0  0.015464    194

[70 rows x 3 columns]

Comment:
Defaults is negatively correlated in general with age. According the US studies they found that unsecured loans increased over the years for the younger generation with an increase debt level. Younger people may have poor borrowing decisions and are more likely to default than the older adults. Older adults are more financially aware and less likely to miss payments. For these reasons, loan rates are much higher for younger people as they are a riskier profile. (REF: BORROWING BEHAVIOUR)

Feature Engineering

Feature engineering is very useful for analysis therefore we will extract new features by combining features, aggregating and transforming existing features into new ones to so our models can identify more patterns. Since the number of dependents is important where a household expenses varies with this variable we will derive our new feature by dividing the income on the number of dependents. Second, we have an existing feature which is debt ratio, so by multiplying it by income we get a new feature which is monthly debt, which is the outstanding monthly debt for the borrower. Third, the monthly balance seems natural which can be calculated by deducting the debt from the income to see the borrower balance. We will derive these new features and look into how important they could be in our modelling.

In [104]:
# Income per member which the whole income divided by the dependents
NewDataDF['IncomeEachMember'] = NewDataDF['MonthlyIncome_log'] / (NewDataDF['Dependents_log'] + 1)

# Also we can create a monthly debt column which will be basically the the monthly income multiply by the debt ratio
NewDataDF['MonthlyDebt'] = NewDataDF['MonthlyIncome_log']*NewDataDF['DebtRatio_log']

# Monthly balance is the balance left after deducting the monthly debt from the the monthly income.
NewDataDF['MonthlyBalance'] = NewDataDF['MonthlyIncome_log']-NewDataDF['MonthlyDebt']
In [105]:
DataDF.columns
Out[105]:
Index(['SeriousDlqin2yrs', 'Revolving', 'age', 'T30DaysLate', 'DebtRatio',
       'MonthlyIncome', 'OpenCredit', 'T90DaysLate', 'RealEstate',
       'T60DaysLate', 'NumberOfDependents', 'Revolving_log', 'age_log',
       '30Days_log', 'DebtRatio_log', 'MonthlyIncome_log', 'OpenCredit_log',
       'RealEstate_log', 'Dependents_log', '60Days_log', '90Days_log'],
      dtype='object')
In [106]:
# Let's plot a heatmap to show the correlation of our target variable SeriousDlqin2yrs with the other features
corr = NewDataDF.corr()
corr_delinquency=corr[['SeriousDlqin2yrs']]

# Correlation sorting
corr_delinquency=corr_delinquency.sort_values(by ='SeriousDlqin2yrs',ascending=False)
plt.figure(figsize = (10,10))

import seaborn as sns
ax= sns.heatmap(corr_delinquency, vmin=-1, cmap='viridis', vmax=1, center=0,  square=True, annot=True, fmt=".2f", )
bottom, top = ax.get_ylim()
ax.set_title("Correlation SeriousDlqin2yrs vs other features")
fig.tight_layout()

# Bug Fixed for top and bottom cropped
ax.set_ylim(bottom + 0.5, top - 0.5)
Out[106]:
(14.0, 0.0)
In [107]:
# The above is a great way to find correlation with the target variable. Let's see if we can confirm this in the modelling section.

Modelling

Our goal here is to measure the importance of our features and perform feature selection. We will use two models, a parametric and non-parametric classifier. Parametrical models have parameters or assumptions regarding the data distribution, whereas boosting trees have parameters related with the algorithm itself, but they don't need assumptions about the data distribution. So, it's vital to model our data in different ways to identify which variables are important which can help us in early detection and maybe even improving the product or service. I will set the random state to ensure results comparability.

In [108]:
# split data into our features (X) and target output (y)
X = NewDataDF.drop(['SeriousDlqin2yrs'],axis=1,inplace=False)
y = NewDataDF['SeriousDlqin2yrs']

# Split our data into 80% training set and 20% testing. 
# We will define random state to number 123, so we can get the same output everytime
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y, test_size=0.2, random_state=123)

Logistic Regression Model

We will use logistic regression as it's well suited for binary classification problems. We will use logistic regression model techniques to learn more about the features and coefficients in pursue to answer our research questions. The coefficients of the logistic regression algorithm will be obtained from our training data by using maximum likelihood estimation.

Output from statsmodels
In [109]:
import statsmodels.api as sm
ModelLogit = sm.Logit(y,X)

result=ModelLogit.fit(maxiter=1000)
print(result.summary())
did_converge = result.mle_retvals["converged"]
Optimization terminated successfully.
         Current function value: 0.200463
         Iterations 14
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       SeriousDlqin2yrs   No. Observations:               149389
Model:                          Logit   Df Residuals:                   149377
Method:                           MLE   Df Model:                           11
Date:                Tue, 10 Dec 2019   Pseudo R-squ.:                  0.1845
Time:                        12:54:38   Log-Likelihood:                -29947.
converged:                       True   LL-Null:                       -36721.
Covariance Type:            nonrobust   LLR p-value:                     0.000
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Revolving_log         0.5369      0.019     28.184      0.000       0.500       0.574
age_log              -0.8169      0.032    -25.713      0.000      -0.879      -0.755
30Days_log            1.1784      0.024     48.569      0.000       1.131       1.226
DebtRatio_log        -0.1419      0.027     -5.280      0.000      -0.195      -0.089
MonthlyIncome_log    -0.0769        nan        nan        nan         nan         nan
OpenCredit_log       -0.0294      0.022     -1.313      0.189      -0.073       0.014
RealEstate_log        0.0475      0.026      1.793      0.073      -0.004       0.099
Dependents_log        0.5432      0.088      6.151      0.000       0.370       0.716
60Days_log            0.8303      0.040     20.532      0.000       0.751       0.910
90Days_log            1.5558      0.032     48.882      0.000       1.493       1.618
IncomeEachMember      0.0964      0.023      4.179      0.000       0.051       0.142
MonthlyDebt          -0.0304        nan        nan        nan         nan         nan
MonthlyBalance       -0.0466        nan        nan        nan         nan         nan
=====================================================================================
Output from scikit-learn

We will fit a model below using Liblinear. It's an open source library and supports logistic regression Our experiments show its very efficient with large scale linear classification data sets.

In [110]:
# Fit a logistic regression model with liblinear and make predictions on test set
ModelLog = LogisticRegression(solver='liblinear', random_state=123)
ModelLog.fit(Xtrain,Ytrain)
Out[110]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
In [111]:
# Use cross val score method to get accuracy of model
#score = ModelLogSci.score(Xtest, Ytest)
#print(score)
scores = cross_val_score(ModelLog,X,y,cv=10)
print('Averge score across all folds is: ',scores.mean()*100,'%')
Averge score across all folds is:  93.56913863242002 %
In [112]:
# Predictions on test data (We opted for AUC score comparision since our data has imbalanced classes)
# Accuracy
Ypred_ModelLog = ModelLog.predict(Xtest)
#accuracy_ModelLog = accuracy_score(Ytest, Ypred_ModelLog)
#print("Logistic regression test accuracy: %.3f" % (accuracy_ModelLog * 100.0))
In [113]:
# Predictions on test data (We opted for AUC score comparision since our data has imbalanced classes)
# AUC
log_pred = ModelLog.predict_proba(Xtest)[:,1]
roc_log = roc_auc_score(Ytest,log_pred)
print("Logistic regression test AUC: %.3f" % (roc_log))
Logistic regression test AUC: 0.835
In [114]:
# Plot our coefficients
FeatureC = pd.DataFrame(list(Xtrain.columns),columns=['features'])
FeatureC['coefficients'] = pd.DataFrame(list(ModelLog.coef_[0]),columns=['coefficients'])
FeatureC['coefficients'] = FeatureC['coefficients'].apply(lambda x: float(format(x, 'f')))
FeatureC.sort_values(by='coefficients',ascending=True).plot(x='features',y='coefficients', kind='barh')
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bd8856c888>
In [115]:
print(FeatureC)
             features  coefficients
0       Revolving_log      0.507449
1             age_log     -1.078067
2          30Days_log      1.174968
3       DebtRatio_log     -0.251210
4   MonthlyIncome_log     -0.114842
5      OpenCredit_log     -0.010325
6      RealEstate_log      0.133764
7      Dependents_log      0.428606
8          60Days_log      0.776593
9          90Days_log      1.571732
10   IncomeEachMember      0.072587
11        MonthlyDebt     -0.042978
12     MonthlyBalance     -0.071864
In [116]:
confusionMatrix = confusion_matrix(Ytest, Ypred_ModelLog)
print(confusionMatrix)
[[27646   190]
 [ 1772   270]]

Comment:
From sklearn.linear_model.LogisticRegression documentation, the features coefficients in the decision function and are log odds ratios. The Output from stats models suggests that highest positive coefficient is the $90Days_log$ feature, followed by $30Days_log$, $60Days_log$, $Revolving_log$, $RealEstate_log$ which are all positive which means that the probability of financial distress increases with the increase of these features. This is also confirmed from the scikit-learn graph. $age_log$ is a negative coefficient which suggests previous findings in this analysis that financial distress decreases by age. It's worth noting that the coefficient values are not identical in both outputs as the latter uses a kind of regularization and different optimizations. The accuracy (93.5%) using the cross validation is high and generalized well using cross fold validation.
TP = 27646, FP = 1772, FN = 190, TN = 270. However, the Recall is very low as the data is imbalanced.

Extreme Gradient Boosting Model

XGBoost is used with a decision-tree as a base learner that uses a gradient boosting framework. The algorithm has a built-in feature importance that we would like to use to explore the importance of our features. Here will see how to measure the importance of our features. XGBoost can automatically provide estimates of feature importance from a trained model.

In [117]:
# Fit the model into a xgb classifier
modelXGB = XGBClassifier()
modelXGB.fit(X, y)    
# Explore feature importance
print(modelXGB.feature_importances_)
# Feature importance plot
plot_importance(modelXGB)
plt.show()
[0.22617169 0.04028863 0.12964267 0.01396231 0.02062026 0.02182838
 0.02950956 0.0227084  0.11525472 0.32205224 0.01644189 0.02363533
 0.01788382]
In [118]:
# Predictions on test data (We opted for AUC score comparision since our data has imbalanced classes)
# Accuracy
#Ypred = modelXGB.predict(Xtest)
#accuracy = accuracy_score(Ytest, Ypred)
#print("XGBoost test accuracy: %.3f" % (accuracy * 100.0))
In [119]:
# Predictions on test data (We opted for AUC score comparision since our data has imbalanced classes)
# AUC
Ypred_AUC = modelXGB.predict_proba(Xtest)[:,1]
roc_XGB = roc_auc_score(Ytest,Ypred_AUC)
print("XGBoost test AUC: %.3f" % (roc_XGB))
XGBoost test AUC: 0.871

Comment:
The graph above shows us the feature importance. It counts the number of times each feature is split on across all trees in the model. Lastly, it visualizes the result as a bar graph, with the features ordered according to how many times they appeared. We can observe that $Revolving_log$ is the most important. But in logistic regression the $90DaysLog$ was the most important. Now, we know which features are important, but how many features should we select from the above? See Feature selection.

In [120]:
plt.figure(figsize=(10,8))
sns.heatmap(NewDataDF.corr(), annot=True, fmt=".2f", cmap='viridis')
plt.title('Pearson Correlation Plot')
plt.xticks(rotation=55)
# Bug Fixed for top and bottom cropped
b, t = plt.ylim() # discover the values for bottom and top
b += 0.4 
t -= 0.4
plt.ylim(b, t)

plt.show()

Comment:
From the Pearson plot above we notice that $MonthlyDebt$ is very high correlated with $MonthlyBalance$, $Debtratio_log$ with $MonthlyBalance$ and $Dependents_log$ with $IncomeEachMember$. It's advisable to remove them but we will need to complete our analysis first.

XGB Pipeline with Grid Search & Hyper parameter tuning
In [122]:
# split data into our features (X) and target output (Y)
X = NewDataDF.drop(['SeriousDlqin2yrs'],axis=1,inplace=False)
y = NewDataDF['SeriousDlqin2yrs']

# Split our data into 80% training set and 20% testing. 
# We will define random state to number 123, so we can get tne same output everytime
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y, test_size=0.2, random_state=123)
In [159]:
# Parameter Tuning. We started with a Random search to get rough paramters 
# Then we used Grid search for a range around the random search results and got the best paramters with max AUC
# We will leave the selected paramters so we don't have a long process evey time we refresh our kernel.

# Random Search Experiment
# randomized_model = RandomizedSearchCV(estimator=pipeline,param_distributions=gbm_param_grid,n_iter=2, scoring='roc_auc', cv=2, verbose=1)
# Fit the estimator
# randomized_model.fit(X, y)

# Parameter grid
# We selected the following paramters from hundreds of paramters combinations
parameters_grid = {
    'clf__loss' : ['deviance'], # 'exponential'
    'clf__learning_rate': [0.1],
    'clf__max_depth': [3],
    'clf__n_estimators': [160],
    'clf__subsample':[0.9],
    'clf__scale_pos_weight':[7]} # Deals with imbalancmenet

# Create pipeline (Very usefull when using hot enoding and label encoding for categorical features)
pipeline = Pipeline([
                     ("clf", xgb.XGBClassifier())])

# Grid search to tune our hyperparamters with 10 fold cross validation
XGB_Grid = GridSearchCV(param_grid=parameters_grid, 
                        estimator=pipeline, 
                        scoring="roc_auc", #"roc_auc" for AUC & "accuracy" for accuracy
                        cv=2, verbose=1) # Return cv to 10 after testing

# Fitting our model
XGB_Grid.fit(X,y)

# Computing metrics (AUC on cross validation set, not to compare with the test)
print(XGB_Grid.best_score_)
print(XGB_Grid.best_params_)
Fitting 2 folds for each of 1 candidates, totalling 2 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   12.6s finished
0.8635881549456706
{'clf__learning_rate': 0.1, 'clf__loss': 'deviance', 'clf__max_depth': 3, 'clf__n_estimators': 160, 'clf__scale_pos_weight': 7, 'clf__subsample': 0.9}
In [124]:
pipeline.get_params().keys()
Out[124]:
dict_keys(['memory', 'steps', 'verbose', 'clf', 'clf__base_score', 'clf__booster', 'clf__colsample_bylevel', 'clf__colsample_bynode', 'clf__colsample_bytree', 'clf__gamma', 'clf__learning_rate', 'clf__max_delta_step', 'clf__max_depth', 'clf__min_child_weight', 'clf__missing', 'clf__n_estimators', 'clf__n_jobs', 'clf__nthread', 'clf__objective', 'clf__random_state', 'clf__reg_alpha', 'clf__reg_lambda', 'clf__scale_pos_weight', 'clf__seed', 'clf__silent', 'clf__subsample', 'clf__verbosity'])
In [125]:
# Fit the best parameters in our XGB boost model
# XGBoost can deal with imbalanced data with ‘scale_pos_weight’ parameter, which can be calculated by sum (negative instances) / sum (positive instances). 
#The parameter penalizes negative cases (0) and assign a higher weight to positive cases (1). 
# The best scale was 7 that gave the best AUC but the accuracy fell drastically. So i'll keep it 1 for analysis & comparision reasons.
XGB_Best_Model = XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               loss='deviance', max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=160, n_jobs=1, nthread=None,
                               objective='binary:logistic', random_state=123,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=7,
                               seed=None, silent=None, subsample=0.9,
                               verbosity=1)

XGB_Best_Model.fit(X, y)
# Explore feature importance
print(XGB_Best_Model.feature_importances_)
# Feature importance plot
plot_importance(XGB_Best_Model)
plt.show()
[0.32444224 0.02886919 0.18953583 0.01276869 0.01417304 0.0198583
 0.03282939 0.01287184 0.13954973 0.1910574  0.0087625  0.01444518
 0.01083665]
In [126]:
# Predictions on test data (We opted for AUC score comparision since our data has imbalanced classes)
# Accuracy
#Ypred = XGB_Best_Model.predict(Xtest)
#accuracy = accuracy_score(Ytest, Ypred)
#print("Accuracy:%.3f" % (accuracy * 100.0))
In [127]:
# Predictions on test data (We opted for AUC score comparision since our data has imbalanced classes)
# AUC
Ypred_AUC = XGB_Best_Model.predict_proba(Xtest)[:,1]
roc_XGB = roc_auc_score(Ytest,Ypred_AUC)
print("XGBoost test AUC: %.3f" % (roc_XGB))
XGBoost test AUC: 0.875
In [128]:
# Feature selection based on feature importance scores
# AUC is calcualted one each subset of features n
threshold = sort(XGB_Best_Model.feature_importances_)
for i in threshold:
        # select features using threshold
        ModelSelected = SelectFromModel(XGB_Best_Model, threshold=i, prefit=True)
        XtrainSelected = ModelSelected.transform(X)
        # train model
        selection_model = XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               loss='deviance', max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=160, n_jobs=1, nthread=None,
                               objective='binary:logistic', random_state=123,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=7,
                               seed=None, silent=None, subsample=0.9,
                               verbosity=1)
        
        selection_model.fit(XtrainSelected, y)
        # Testing our test set and evaluating our model
        XtestSelected = ModelSelected.transform(Xtest)
        y_pred = selection_model.predict(XtestSelected)
        y_proba = selection_model.predict_proba(XtestSelected)[:,1]
        roc = roc_auc_score(Ytest,y_proba)
        accuracy = accuracy_score(Ytest, y_pred)
        print("Threshold=%.3f, n of features=%d, AUC:%.3f, Accuracy:%.2f%%" % (i, XtrainSelected.shape[1], roc, accuracy*100.0))
Threshold=0.009, n of features=13, AUC:0.875, Accuracy:88.65%
Threshold=0.011, n of features=12, AUC:0.875, Accuracy:88.65%
Threshold=0.013, n of features=11, AUC:0.875, Accuracy:88.56%
Threshold=0.013, n of features=10, AUC:0.875, Accuracy:88.63%
Threshold=0.014, n of features=9, AUC:0.874, Accuracy:88.62%
Threshold=0.014, n of features=8, AUC:0.873, Accuracy:88.55%
Threshold=0.020, n of features=7, AUC:0.871, Accuracy:88.59%
Threshold=0.029, n of features=6, AUC:0.868, Accuracy:88.90%
Threshold=0.033, n of features=5, AUC:0.865, Accuracy:88.88%
Threshold=0.140, n of features=4, AUC:0.861, Accuracy:88.82%
Threshold=0.190, n of features=3, AUC:0.851, Accuracy:88.75%
Threshold=0.191, n of features=2, AUC:0.827, Accuracy:88.41%
Threshold=0.324, n of features=1, AUC:0.795, Accuracy:82.29%

Comment:
The performance of the model started with AUC 0.875 with 13 features at threshold 0.013 and stayed the same at feature (n=10), then generally decreases with the number of selected features. So our goal is to have the best AUC 0.875 with the least number of features (n=10), therefore we selected the top 10 features from the chart above. We can say that we successfully selected our top important features. The two new features MonthlyDebt and MonthlBalance performed good and were selected in the top 10 features.

Feature Selection

Based on the above we opted for the top 10 features

In [129]:
# Our final Data dataframe. Only top selected 10 features. 
FinalDataDF = NewDataDF.drop(['IncomeEachMember', 'Dependents_log', 'RealEstate_log'],axis=1,inplace=False)
In [130]:
FinalDataDF.columns
Out[130]:
Index(['SeriousDlqin2yrs', 'Revolving_log', 'age_log', '30Days_log',
       'DebtRatio_log', 'MonthlyIncome_log', 'OpenCredit_log', '60Days_log',
       '90Days_log', 'MonthlyDebt', 'MonthlyBalance'],
      dtype='object')
In [131]:
# Train and test split on the new dataframe with our top selection.
XFinal = FinalDataDF.drop(['SeriousDlqin2yrs'],axis=1,inplace=False)
yFinal = FinalDataDF['SeriousDlqin2yrs']
Xtrain, Xtest, Ytrain, Ytest = train_test_split(XFinal,yFinal, test_size=0.2, random_state=123)
In [132]:
# Train our final model
# We can increase the accuracy for true positives or negatives by increasing scale_pos_weight to handle the imbalanaced class
XGB_Final_Model = XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               loss='deviance', max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=160, n_jobs=1, nthread=None,
                               objective='binary:logistic', random_state=123,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=7,
                               seed=None, silent=None, subsample=0.9,
                               verbosity=1)
XGB_Final_Model.fit(XFinal, yFinal)
Out[132]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, loss='deviance', max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=160, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=123,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=7, seed=None,
              silent=None, subsample=0.9, verbosity=1)
In [133]:
# confusion Matrix
# Function for confusion matrix, normalized confusion matrix & sklearn classification report 
from sklearn.metrics import confusion_matrix
def ConfusionMatrix(yTrue, yPred, title = 'Confusion Matrix', cmap=plt.cm.Blues):
    print(classification_report(Ytest, y_pred));
    cm = confusion_matrix(yTrue, yPred)
    def ConfusionMatrixPlot(cm, title = 'Confusion Matrix', cmap=plt.cm.Blues):
        ax= plt.subplot()
        sns.heatmap(cm, annot=True, ax = ax, cmap="Blues");
        # labels, title and ticks
        ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
        ax.set_title('Confusion Matrix'); 
        # Text Position
        b, t = plt.ylim()
        b += 0.4 
        t -= 0.4
        plt.ylim(b, t)
    
    print (cm)
    ConfusionMatrixPlot(cm=cm) 
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    print('\n Normalized confusion matrix \n')
    print(cm_normalized)
    plt.figure()
    ConfusionMatrixPlot(cm_normalized, title='Normalized Confusion Matrix ')
In [134]:
# Predictions of final model an test set
yPredFinal = XGB_Final_Model.predict(Xtest)
ConfusionMatrix(Ytest, yPredFinal)
              precision    recall  f1-score   support

           0       0.97      0.84      0.90     27836
           1       0.21      0.59      0.31      2042

    accuracy                           0.82     29878
   macro avg       0.59      0.72      0.61     29878
weighted avg       0.91      0.82      0.86     29878

[[25162  2674]
 [  749  1293]]

 Normalized confusion matrix 

[[0.90393735 0.09606265]
 [0.36679726 0.63320274]]

Comment:
The matrix above shows that the predictions for the smaller class is better (recall).
Scale parameter performed much better than before and dealt with the imbalancement.

In [158]:
sns.set(rc={'figure.figsize':(6,6)})
# Plotting ROC curve with AUC display
y_pred_proba = XGB_Final_Model.predict_proba(Xtest)[::,1]

y_proba = selection_model.predict_proba(XtestSelected)[:,1]


fpr, tpr, _ = metrics.roc_curve(Ytest,  y_pred_proba)
auc = metrics.roc_auc_score(Ytest, y_pred_proba)
plt.plot(fpr,tpr,label="AUC = %.2f"  % (auc))
plt.legend(loc=5)
plt.show()

Comment:
The AUC is 0.87 which is very high so we can trust the results. ROC AUC metric does not depend on a strict discrete classification, but on the predicted class probabilities, that's why is more applicable to our type of problem.

Explainability

Tree Plot Explainer

In [138]:
import xgboost as xgb
xgb.plot_tree(XGB_Final_Model, num_trees=0)
fig = plt.gcf()
fig.set_size_inches(100, 150)
fig.savefig('tree1.png')

Comment:
The above is a tree plot for our final XGB model explanation. It is great to explain any predictions by following the logic. Could be messy if there were more than one tree.

Shap Explainers

Model explainability today is very important in data science. SHAP is a solid library for helping provide these explanations. We will try to explain our final XGB model by using this library and how it uses features in predicting.

In [140]:
# Shap explainers
explainerXGB = shap.TreeExplainer(XGB_Final_Model)
shap_values_XGB_test = explainerXGB.shap_values(Xtest)
shap_values_XGB_train = explainerXGB.shap_values(Xtrain)
df_shap_XGB_test = pd.DataFrame(shap_values_XGB_test, columns=Xtest.columns.values)
df_shap_XGB_train = pd.DataFrame(shap_values_XGB_train, columns=Xtrain.columns.values)
In [141]:
j = 0
# initialize js for SHAP
shap.initjs()
# force plot 
shap.force_plot(explainerXGB.expected_value, shap_values_XGB_test[j], Xtest.iloc[[j]])
Out[141]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Comment:
The graph above is called a force plot. It shows features contributing to push the prediction from the base value. The base value is the average model output over the training dataset we passed. Features pushing the prediction higher are shown in red. Features pushing it lower appear in blue. The record we are testing from the test set has a lower than average predicted value at -1.52 compared to -0.78. $MonthlyIncome_log$ is 8 for this record. This pushes the predicted value lower. The details here explain why an individual prediction was made on a local level. This is important to help us include recommendations with other decision factors.

In [143]:
# Variable importance graphs
# SHAP provides a theoretically sound method for evaluating variable importance
shap.summary_plot(shap_values_XGB_train, Xtrain, plot_type="bar")
In [144]:
shap.summary_plot(shap_values_XGB_train, Xtrain)

Comment:
Above is a summary plot showing the SHAP values for every instance from the training dataset, which can lead to a better understanding of overall patterns.

PCA and t-SNE Visualization

PCA

In [145]:
X = FinalDataDF.drop(['SeriousDlqin2yrs'],axis=1,inplace=False)
y = FinalDataDF['SeriousDlqin2yrs']
sns.set(rc={'figure.figsize':(12,12)})
sns.set_style("whitegrid")
In [146]:
# PCA n_components=2
pca = PCA(n_components=2)
#  Scaling the data
scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
XTransformed = scaler.fit_transform(X)
pcaDF = pd.DataFrame(XTransformed)
pcaFit = pca.fit_transform(pcaDF)
result = pd.DataFrame(pcaFit, columns=['PCA1', 'PCA2'])
result['class'] = y
In [147]:
ax = sns.lmplot(data = result, x='PCA1', y='PCA2',hue='class', height=10, aspect=4,  fit_reg=False, palette='Set1',
               scatter_kws={'alpha': 0.2})
fig = plt.gcf()
fig.set_size_inches(12,6)
In [148]:
pca.explained_variance_ratio_
Out[148]:
array([0.29730182, 0.22030852])

Comment:
In the PCA plot there is a significant overlap, however defaults concentrated on the line that are close PCA2 = 0 & between PCA2 0 to 5 at PCA 1 0 to -2 In our case we tried predictions but had much lower scores with PCA so we will not use in our analysis. The variance ratio is as following:
First component 30% of the variance Second component 22% of the variance Total is 52%

t-SNE

In [150]:
sns.set(rc={'figure.figsize':(12,12)})
sns.set_style("whitegrid")
tsne = TSNE(early_exaggeration=7.0, learning_rate=100.0, n_iter=250, metric='cosine', perplexity=20, verbose=2)
n = 75000
Xs, Xn, ys, yn = train_test_split(X, y, test_size=(1-(n/float(len(FinalDataDF)))))
tsne_transformed = tsne.fit_transform(Xs)
tsneDF = pd.DataFrame(tsne_transformed, columns=['1', '2'])
ys.index = range(0,len(tsneDF))
tsneDF['class'] = ys
sns.lmplot(x='1', y='2',data=tsneDF,hue='class',height=20,aspect=4,fit_reg=False,palette="Set1",scatter_kws={'alpha':0.7})
[t-SNE] Computing 61 nearest neighbors...
[t-SNE] Indexed 75000 samples in 0.002s...
[t-SNE] Computed neighbors for 75000 samples in 115.495s...
[t-SNE] Computed conditional probabilities for sample 1000 / 75000
[t-SNE] Computed conditional probabilities for sample 2000 / 75000
[t-SNE] Computed conditional probabilities for sample 3000 / 75000
[t-SNE] Computed conditional probabilities for sample 4000 / 75000
[t-SNE] Computed conditional probabilities for sample 5000 / 75000
[t-SNE] Computed conditional probabilities for sample 6000 / 75000
[t-SNE] Computed conditional probabilities for sample 7000 / 75000
[t-SNE] Computed conditional probabilities for sample 8000 / 75000
[t-SNE] Computed conditional probabilities for sample 9000 / 75000
[t-SNE] Computed conditional probabilities for sample 10000 / 75000
[t-SNE] Computed conditional probabilities for sample 11000 / 75000
[t-SNE] Computed conditional probabilities for sample 12000 / 75000
[t-SNE] Computed conditional probabilities for sample 13000 / 75000
[t-SNE] Computed conditional probabilities for sample 14000 / 75000
[t-SNE] Computed conditional probabilities for sample 15000 / 75000
[t-SNE] Computed conditional probabilities for sample 16000 / 75000
[t-SNE] Computed conditional probabilities for sample 17000 / 75000
[t-SNE] Computed conditional probabilities for sample 18000 / 75000
[t-SNE] Computed conditional probabilities for sample 19000 / 75000
[t-SNE] Computed conditional probabilities for sample 20000 / 75000
[t-SNE] Computed conditional probabilities for sample 21000 / 75000
[t-SNE] Computed conditional probabilities for sample 22000 / 75000
[t-SNE] Computed conditional probabilities for sample 23000 / 75000
[t-SNE] Computed conditional probabilities for sample 24000 / 75000
[t-SNE] Computed conditional probabilities for sample 25000 / 75000
[t-SNE] Computed conditional probabilities for sample 26000 / 75000
[t-SNE] Computed conditional probabilities for sample 27000 / 75000
[t-SNE] Computed conditional probabilities for sample 28000 / 75000
[t-SNE] Computed conditional probabilities for sample 29000 / 75000
[t-SNE] Computed conditional probabilities for sample 30000 / 75000
[t-SNE] Computed conditional probabilities for sample 31000 / 75000
[t-SNE] Computed conditional probabilities for sample 32000 / 75000
[t-SNE] Computed conditional probabilities for sample 33000 / 75000
[t-SNE] Computed conditional probabilities for sample 34000 / 75000
[t-SNE] Computed conditional probabilities for sample 35000 / 75000
[t-SNE] Computed conditional probabilities for sample 36000 / 75000
[t-SNE] Computed conditional probabilities for sample 37000 / 75000
[t-SNE] Computed conditional probabilities for sample 38000 / 75000
[t-SNE] Computed conditional probabilities for sample 39000 / 75000
[t-SNE] Computed conditional probabilities for sample 40000 / 75000
[t-SNE] Computed conditional probabilities for sample 41000 / 75000
[t-SNE] Computed conditional probabilities for sample 42000 / 75000
[t-SNE] Computed conditional probabilities for sample 43000 / 75000
[t-SNE] Computed conditional probabilities for sample 44000 / 75000
[t-SNE] Computed conditional probabilities for sample 45000 / 75000
[t-SNE] Computed conditional probabilities for sample 46000 / 75000
[t-SNE] Computed conditional probabilities for sample 47000 / 75000
[t-SNE] Computed conditional probabilities for sample 48000 / 75000
[t-SNE] Computed conditional probabilities for sample 49000 / 75000
[t-SNE] Computed conditional probabilities for sample 50000 / 75000
[t-SNE] Computed conditional probabilities for sample 51000 / 75000
[t-SNE] Computed conditional probabilities for sample 52000 / 75000
[t-SNE] Computed conditional probabilities for sample 53000 / 75000
[t-SNE] Computed conditional probabilities for sample 54000 / 75000
[t-SNE] Computed conditional probabilities for sample 55000 / 75000
[t-SNE] Computed conditional probabilities for sample 56000 / 75000
[t-SNE] Computed conditional probabilities for sample 57000 / 75000
[t-SNE] Computed conditional probabilities for sample 58000 / 75000
[t-SNE] Computed conditional probabilities for sample 59000 / 75000
[t-SNE] Computed conditional probabilities for sample 60000 / 75000
[t-SNE] Computed conditional probabilities for sample 61000 / 75000
[t-SNE] Computed conditional probabilities for sample 62000 / 75000
[t-SNE] Computed conditional probabilities for sample 63000 / 75000
[t-SNE] Computed conditional probabilities for sample 64000 / 75000
[t-SNE] Computed conditional probabilities for sample 65000 / 75000
[t-SNE] Computed conditional probabilities for sample 66000 / 75000
[t-SNE] Computed conditional probabilities for sample 67000 / 75000
[t-SNE] Computed conditional probabilities for sample 68000 / 75000
[t-SNE] Computed conditional probabilities for sample 69000 / 75000
[t-SNE] Computed conditional probabilities for sample 70000 / 75000
[t-SNE] Computed conditional probabilities for sample 71000 / 75000
[t-SNE] Computed conditional probabilities for sample 72000 / 75000
[t-SNE] Computed conditional probabilities for sample 73000 / 75000
[t-SNE] Computed conditional probabilities for sample 74000 / 75000
[t-SNE] Computed conditional probabilities for sample 75000 / 75000
[t-SNE] Mean sigma: 0.002199
[t-SNE] Computed conditional probabilities in 3.973s
[t-SNE] Iteration 50: error = 69.6644745, gradient norm = 0.0000005 (50 iterations in 105.730s)
[t-SNE] Iteration 100: error = 69.6634750, gradient norm = 0.0001202 (50 iterations in 180.395s)
[t-SNE] Iteration 150: error = 61.4262505, gradient norm = 0.0020427 (50 iterations in 179.979s)
[t-SNE] Iteration 200: error = 55.7663383, gradient norm = 0.0011480 (50 iterations in 134.132s)
[t-SNE] Iteration 250: error = 52.8211098, gradient norm = 0.0007951 (50 iterations in 122.913s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 52.821110
[t-SNE] KL divergence after 251 iterations: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
Out[150]:
<seaborn.axisgrid.FacetGrid at 0x1bdeb356348>

Comment:
In the t-SNE plot, there is a high concentration of default predictions on the some of the rings on the right side.
Even so, PCA & T-sne didn't help us answer our questions but it was helpful to understand our dataset.



Markdown word count: 1509

The End